1 Introduction

For this project the data from the “VII Encuesta de Presupuestos Familiares” (VII Household Budget Survey) was selected. This is a survey done every 2 to 3 years in Chile, that contains data about household’s inhabitants, some of their social characteristics as income, education, gender, age, etc., and the expenses each household have within a month. The data is splited in two data sets, one with the inhabitant description and income variables, with some groupping categories by households (households); the other with the expenses reported in several categories by household id (expenses).

Some data wrangling will be needed before starting some of the exploratory data analysis, given that the households data set have one entry by each household member, while the expenses data set contains the expenses only by household not separated by household inhabitant: this means that is possible to merge the data by matching the household ids, but not by individuals (because that was the intended use for the data).

2 Loading libraries and data set

The following libraries were used for this work:

The data is stored in RData files, after being transformed from SPSS data sets.

load("households.RData")
load("expenses.RData")

Some cleaning is still needed, for example there two negative ages and some households without a total income reported, because of missing data.

households <- subset(households, age > 0 & !is.na(income.hh.av.rent))

3 Univariate Analysis.

Let’s start with some simple explorations of the population in our data set. From the variables descriptions we decided to focus on the following variables:

What is the population’s age distribution? The summary function can give us a start:

summary(households[,'age'])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    17.0    32.0    34.9    51.0   103.0

The next figure shows an histogram using the age variable (a discrete numerical variable). The binwidths are equal to 1 year. The figure shows that the population is not normally distributed and positevely skewed overall. This is expected, since the population must decrese with age as people dies by accidents, illness or natural causes.

However is interesting to notice some peaks at around 5, 25 and 50 years of age: they might correspond to generations with higher natallity rates or less infant mortality.

What about the education attainment distribution?

Let’s change the x scale so we can see more details on the distribution if the household’s income.

decil <- quantile(houseinc$income.hh.av.rent, probs = seq(0, 1, 0.1), na.rm = T)

decil[1] <- 0 
decil[11] <- decil[11] + 100

households$income.dec <- cut(households$income.hh.av.rent, decil, right = F)
houseinc$income.dec <- cut(houseinc$income.hh.av.rent, decil, right = F)

levels(houseinc$income.dec) <- c(
  "US$3 to US$419", "US$420 to US$575", "US$576 to US$738", "US$739 to US$907",
  "US$908 to US$1109", "US$1110 to US$1369", "US$1370 to US$1709", 
  "US$1710 to US$2309", "US$2310 to US$3609", "US$3610 and Up" )

levels(households$income.dec) <- levels(houseinc$income.dec)

indec <- subset(houseinc, !is.na(income.dec)) %>% 
  group_by(income.dec) %>% 
  summarise(Total = sum(income.hh.av.rent, na.rm = T))

So, while the mean household income is US$1707 the median is at US$1110. How it does compare with other countries?

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     2.96   653.40  1110.00  1707.00  1976.00 53280.00

summary(inhab$num.inhabitants)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.389   4.000  15.000
ord <- as.character(
  arrange(data.frame(table(households$edependence)), desc(Freq))$Var1)

ggplot(aes(factor(edependence, levels = ord)), data = households) +
  geom_histogram(fill = "#67a9cf") + 
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  xlab('Educational Institution Type (private/public)') +
  ylab('Frequency')

ggplot(aes(factor(edependence, levels = ord),
           y = 100 * ..count.. / sum(..count..)),
       data = households[households$edependence %in% ord[c(3,2,5)], ]) +
  geom_histogram(fill = "#67a9cf") +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  xlab('Primary Education Type (private/public)') +
  ylab('Percentage')

summary(subset(households, !is.na(health.exp))$health.exp)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.402   27.240   44.360   80.980   91.500 2279.000
ggplot(aes(health.exp), data = subset(households, !is.na(health.exp))) +
  geom_histogram(fill = "#67a9cf", binwidth = 10) +
  xlab('Health Expenditure (US$)') +
  ylab('Frequency')

ggplot(aes(health.exp), data = subset(households, !is.na(health.exp))) +
  geom_histogram(fill = "#67a9cf", binwidth = 0.2) +
  scale_x_continuous(trans = "log1p", breaks = c(5,50,100,200,1000,2000)) +
  xlab('Health Expenditure (US$)') +
  ylab('Frequency')

4 Bivariate Analysis.

We can arrange a little bit more this plot and create a population pyramid:

The peaks seem to change for each gender! We can also notice that there are more women (53.2%) than men (46.8%). Are the gender’s average age different?

We see a difference in the average age for both gender, with males having an overall younger population.

  Min. 1st Qu. Median Mean 3rd Qu. Max.
Men 1 16 30 33.49 50 101
Women 1 18 34 36.15 52 103

Let’s test if the difference in statistical significant by using the Wilcoxon Rank Test:

Wilcoxon rank sum test with continuity correction: age by gender
Test statistic P value Alternative hypothesis
143282477 2.947e-28 * * * two.sided

What about the education attainment distribution? We will study the distribution using an stacked histogram, so we can study at the same time if there are any significant differences between genders.

We see a pick in the distribution at category 5, which correspond to the primary education. This doesn’t mean that most of the population only reach primary school: we have not removed the school-age population. We could use two variables to subset the data and include only the population that is no longer studying:

As shown in the Introduction

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## Warning: Removed 45 rows containing non-finite values (stat_boxplot).
## Warning: Removed 45 rows containing missing values (geom_point).

What are households expending on?

tabex <- expenses %>% group_by(description) %>% summarise(expend = sum(expense))
usar <- as.character(
  as.data.frame(tabex[order(-tabex$expend)[2:33][-17][-14], ])[, 1])

highexp <- subset(expenses, description %in% usar)

tabexp <- expenses %>% group_by(description) %>% summarise(co = n())
usar1 <- as.character(
  as.data.frame(tabexp[order(-tabexp$co)[1:31][-5], ])[, 1])
moreexp <- subset(expenses, description %in% usar1)

tabexp2 <- highexp %>% group_by(description) %>% summarise(med = median(expense))
usar2 <- as.character(as.data.frame(tabexp2[order(-tabexp2$med), ])[, 1])



ggplot(aes(factor(description, levels = usar2), expense), data = highexp) + 
  geom_jitter(alpha = 0.3, color = "#67a9cf") + 
  geom_boxplot(alpha = 0.8) + 
  scale_y_log10() + 
  scale_x_discrete(labels = 1:30) +
  xlab('Expenditure Category') +
  ylab('Expenditure Amount (US$)')

tabexp3 <- moreexp %>% group_by(description) %>% summarise(med = median(expense))
usar3 <- as.character(as.data.frame(tabexp3[order(-tabexp3$med), ])[, 1])

ggplot(aes(factor(description, levels = usar3), expense), data = moreexp) + 
  geom_jitter(alpha = 0.3, color = "#67a9cf") + 
  geom_boxplot(alpha = 0.8) + 
  scale_y_log10(limits = c(0.1, 10000)) + 
  scale_x_discrete(labels = 1:30) +
  xlab('Expenditure Category') +
  ylab('Expenditure Amount (US$)')
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).

households$age.group <- cut(households$age, seq(15, 81, 5), right = F)


ggplot(aes(age.group),
       data = subset(households, !is.na(dep.work.income) & !is.na(age.group))) +
  geom_boxplot(aes(y = dep.work.income, fill = gender)) + 
  scale_y_continuous(trans = 'log1p') + 
  scale_fill_brewer(palette = "Paired")

t1 <- dcast(subset(households, !is.na(dep.work.income)),
            age.group ~ gender,
            fun = median,
            value.var = 'dep.work.income')

w <- households %>%
  subset(!is.na(dep.work.income)) %>% 
  group_by(age.group, gender) %>% 
  summarise(inco = median(dep.work.income))

ggplot(aes(age.group), data = subset(w, !is.na(age.group))) + 
  geom_histogram(aes(y = inco, fill = gender), stat = 'identity',
                 alpha = 0.5, position = "identity", color = 'grey') + 
  scale_fill_brewer(palette = "Paired")

t <- 
  left_join(
    expenses, 
    subset(households, person.id == 1,
           select = c(num.inhabitants, income.dec, home.id)), 
    by = 'home.id') %>% 
  subset(!is.na(income.dec))

expsum <- 
  t %>% 
  group_by(income.dec, d) %>% 
  summarise(total = sum(expense), median = median(expense), quant = n())

ggplot(aes(d, y = total), data = expsum) + 
  geom_bar(aes(fill = d), stat = 'identity') +
  scale_x_discrete(labels = '') +
  scale_fill_brewer(palette = "Paired") +
  facet_grid( ~ income.dec)

ggplot(aes(income.dec, y = median), data = expsum) + 
  geom_bar(aes(fill = income.dec), stat = 'identity') +
  scale_x_discrete(labels = '') +
  scale_fill_brewer(palette = "Paired") +
  facet_grid( ~ d)

expsum <- 
  expsum %>% 
  group_by(income.dec) %>% 
  mutate(suma = sum(total))

ggplot(aes(income.dec, y = 100 * total / suma), data = expsum) + 
  geom_bar(aes(fill = income.dec), stat = 'identity') + 
  scale_x_discrete(labels = '') +
  scale_fill_brewer(palette = "Paired") +
  facet_grid( ~ d)

D Code Description
01 Food and non-alcoholic beverages
02 Alchoholic beverages, tobacco and narcotics
03 Clothing and footwear
04 Housing, water, electricity, gas and other fuels
05 Furnishings, household equipment and routine household maintenance
06 Health
07 Transport
08 Communication
09 Recreation and culture
10 Education
11 Restaurants and hotels
12 Miscellaneous goods and services